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Abstract 

Graphical models are increasingly used in genetic analyses to take into account the complex relationships between 
genetic and nongenetic factors influencing the phenotypes. We propose a model for determining the network 
structure of quantitative traits while accounting for the correlated nature of the family-based samples using the 
kinship coefficient. The Gaussian graphical model of age, systolic blood pressure, diastolic blood pressure, 
hypertension, blood pressure medication use, and smoking status was derived for three time points using real data. 
We also explored binary sparse graphical models of single-nucleotide polymorphisms (SNPs), covariates, and 
quantitative traits for exploratory analysis of the data. We validated the applicability of this method by producing a 
network graph using 20 causal variants, 21 noncausal variants, and 6 binary and quantitative phenotypes using the 
simulated data. To improve the model's ability to identify associations between the causal variants and the 
phenotypes, we intend to conduct follow-up studies investigating how to use the relationships between SNPs and 
between SNPs and phenotypes when analyzing genome wide association data with multiple phenotypes. 



Background 

Graphical models are popular methods for exploratory 
data analysis [1]. Understanding the network structure of 
various genetic and nongenetic factors affecting pheno- 
types is gaining importance because more methods are 
developed that are capable of using such information. 
Analyzing high-dimensional data is a challenging task. 
Subset selection is a useful tool to remove noise in the 
data. LASSO-based methods [2] are very useful for reli- 
able analysis of high-dimensional data. Here, we propose 
a model for determining the network structure of quanti- 
tative traits while accounting for the correlated nature of 
the family-based samples using the kinship coefficient. 
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Methods 

Gaussian graphical models for quantitative traits in 
pedigrees 

Suppose Y is a n X p data matrix containing n indivi- 
duals and p quantitative traits (e.g., systolic blood pres- 
sure [SBP] and diastolic blood pressure [DBP]). The 
individuals are correlated because they were sampled 
from pedigrees. The correlation among the samples 
attributed to shared genetics and environment can be 
modeled using the kinship coefficient [3]. The kinship 
coefficient determines the genetic similarity or related- 
ness between 2 individuals within a pedigree. We intend 
to estimate the dependence structure between the traits 
while accounting for the correlation in the samples. 

The data can be modeled using matrix normal distri- 
bution [4], which models the dependency structures; 
the dependency between the samples; and the depen- 
dency between the traits, Y„y,p ~ MatrixNormai(o,„p,n'i;^p,R„,,„y 
where Onxp is the standardized mean, Q is the inverse of 
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the covariance of the p quantitative traits, and R is the 
covariance matrix of the « related samples. Because the 
individuals from different pedigrees are independent and 
the correlation exists only within a pedigree, is a 
block diagonal matrix, with each block corresponding to 
a pedigree. Within each pedigree, the correlation coeffi- 
cients between the samples are specified as twice the 
value of the kinship coefficient. Using reported heritabil- 
ity values of SBP and DBP and using phenotypic corre- 
lation between monozygotic twins [5], we estimated the 
shared environmental component to be approximately 
0.1. Therefore, a constant c = 0.1 was added to the cor- 
relation of samples within a pedigree to account for the 

shared environmental factors. The likelihood for the 

n p 

model is: Lifee/iTiood a det(Q)2 det (R)2exp[-itrace(QY^R-iY)]- 
The log likelihood can be written as 

n 1 T -1 

logJikelihood oc —logdet{Q) — —trace{QY R Y). 

Many methods are available to optimize Q over the 
model's likelihood function. But to identify the conditional 
independence structure of the p traits, we have to find 
traits that are conditionally independent given all other 
traits. This information is contained in the inverse of the 
correlation matrix Q between the traits. If ^ij = 0, trait i 
and trait j are conditionally independent, given all other 
traits. To enforce sparsity in the estimation of fi, 
the LASSO penalty can be imposed on the likelihood. 
The resulting penalized log likelihood is 

logJikelihood oc ^logdet[n) - ^£race(i2Y^R"^y) - p ||S2||, 

where ||J2|| is the LI norm of Q.. The LASSO penalty esti- 
mates the sparse graphical model [6] of dependence 
between the quantitative traits by forcing the value of non- 
significant elements in the inverse of the covariance matrix 
to zero. At each time point, the precision matrix between 
the phenotypes was estimated using graphical lasso, pena- 
lization on the graphical model likelihood using the R 
package "glasso" [6]. Because the correlation among indivi- 
duals within a family is not accounted for in the standard 
glasso, we modified this package to account for such 
relationships using the kinship coefficient. The kinship 
coefficient for individuals within the pedigrees was esti- 
mated using the SimWalk2 program [7]. 

Sparse graphical models for binary and quantitative traits 

The proposed model is not suitable for exploring the 
association between single-nucleotide polymorphisms 
(SNPs) and quantitative traits such as SBP and DBP. In 
the case of continuous traits, conditional independence 
between nodes is directly estimated using the partial 
correlation coefficients, which are related to the inverse 
of the correlation matrix. This interpretation of condi- 
tional independence is not possible, however, when the 



traits are discrete. The assumptions required for Gaus- 
sian graphical models are not satisfied for discrete phe- 
notypes such as hypertension and smoking status or for 
SNP genotypes. Some of the popular approaches for 
estimating graphical networks of discrete variables are 
based on the LASSO regression [6,8]. 

Assume Y is a « x pdata matrix containing n indivi- 
duals and p quantitative and discrete traits. The p vari- 
ables, or nodes, of the network contain SNPs, 
quantitative phenotypes such as age, SBP and DBP, and 
binary phenotypes such as hypertension, blood pressure 
medication, and smoking status. Each of the variables (e. 
g., SBP, DBP, hypertension, age, SNPs) is considered as 
a response and is regressed on all of the other variables, 
which are considered as predictors. The predictors asso- 
ciated with the response variable are considered to be in 
the neighborhood of a particular variable. After comput- 
ing the neighborhoods for all the variables, an AND 
operator or an OR operator is used to determine the 
conditional independence of 2 traits i, j. (i.e., if i is in the 
neighborhood of j AND/OR ; is in the neighborhood of 
i, they are conditionally dependent, given all other vari- 
ables). The strength of the dependence can be measured 
by taking the maximum, minimum, or average of the 2 
neighborhood measures between We performed 
LASSO regression for all the variables based on a cross- 
validated penalty parameter to estimate the sparse 
shrinkage coefficients. 

Results 

Data 

Using the Gaussian graphical model for pedigrees and 
sparse graphical models for discrete and quantitative 
traits, we analyzed Genetic Analysis Workshop 18 
(GAW18) data, which includes genome-wide association 
data for 400,000 SNPs, along with simulated and real 
phenotypic information SBP, DBP, hypertension, blood 
pressure medication use, and smoking status. The real 
data contained 939 individuals within 20 pedigrees at 4 
time points. Missing data were present at all the time 
points. We excluded individuals with missing data for 
each of the time points and performed our analyses on 
the remaining data. For the analysis of data using the 
Gaussian graphical model for quantitative traits in pedi- 
grees, we analyzed the first 3 time points for the 6 phe- 
notypes in the real data. The fourth time point was 
excluded from the analysis because most of the data was 
missing for this time point. 

For the sparse graphical model with discrete and con- 
tinuous traits, we concentrated on chromosome 3. We 
used genome-wide association data for constructing the 
network. Two hundred replicates of simulated data for 
the 3 time points were available that were generated 
using the real pedigree structures. We used a single 
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replicate of the simulated data for the phenotypes. Only 
the unrelated individuals from the first time point were 
used for this analysis. In the simulated data, a total of 
1457 genetic variants were causal for either SBP or DBP 
across all the chromosomes. Of these 1457 causal var- 
iants, 188 variants were located on chromosome 3. We 
randomly sampled 20 of these 188 variants on chromo- 
some 3 in our analysis. 

Gaussian graphical models for pedigrees 

We derived the graphical models of 6 traits and covari- 
ates, accounting for pedigree structure: age, SBP, DBP, 
hypertension, blood pressure medication use, and smok- 
ing status. Because hypertension, blood pressure medica- 
tion use, and smoking status are discrete phenotypes, we 
transformed these variables into quantitative phenotypes 
using a logistic regression framework in which all the 
other phenotypes were regressed as dependent variables 
in the logistic model. At each time point, the graph 
shows the conditional relationships among the pheno- 
types. For example, in Figure 1, the graph for the second 



time point shows that age and DBP are conditionally 
negatively correlated given all the other phenotypes. The 
weight of the edge is the partial correlation between age 
and DBP, which was -0.2042. Similarly, the other edges 
point out the conditional relationships among the other 
phenotypes. The graph structure remained essentially 
the same for all 3 time points. Smoking status was not 
related to any of the other phenotypes at the 3 time 
points. Whereas DBP was inversely correlated with age, 
SBP was positively correlated with age. 

Sparse graphical models for binary and quantitative traits 

We validated the sparse graphical methodology using 
the simulated genome-wide association data. Twenty of 
the 188 causal SNPs on chromosome 3 were randomly 
sampled. We also analyzed 21 consecutive noncausal 
SNPs from the same chromosome. The causal and non- 
causal SNPs analyzed are detailed in Figure 2. The gra- 
phical model also included the 6 phenotypes studied in 
the Gaussian model (age, SBP, DBP, hypertension, blood 
pressure medication use, and smoking status) for the 





Figure 1 Relationships among various traits accounting for correlation within and among pedigrees Shown are the relationships among 
the phenotypes in terms of a Gaussian graphical model. The edges correspond to the conditional relationship between 2 phenotypes, given all 
other phenotypes after accounting for the family structure. The 3 graphs represent sparse graphical models for each of the 3 time points. 
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first time point. Thus, our sparse graphical network 
model used 6 phenotypes, 20 causal SNPs, and 21 non- 
causal SNPs. 

We performed LASSO regression using all 47 genetic 
and nongenetic factors and constructed the graph as 
described in the methods section. We used the AND 
operator for the conditional independence of 2 nodes to 
get a sparser graph. The strength of dependence was 
measured using the maximum measure of the 2 regres- 
sion coefficients. Figure 2 shows the sparse graphical 
network of the phenotypes and the causal and noncausal 
SNPs. The phenotypes are coded in red, the causal SNPs 



in pink, and the noncausal SNPs in green. The 21 non- 
causal SNPs are in linkage disequilibrium (LD) with 
each other because of their proximity, which explains 
the huge number of edges between them. The network 
shows that the causal SNPs are linked with different 
phenotypes, but the noncausal SNPs are not linked to 
the phenotypes. However, 2 noncausal SNPs (rsll59106, 
rs4684741) were associated with the phenotypes. This 
can be explained by the noncausal SNPs being in low 
LD with 2 causal SNPs (rsll711953 and rs3772985, 
respectively), as shown by the blue edges in Figure 2. 
The values were 0.049 and 0.043, respectively. All of 
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Figure 2 Single-nucleotide polymorphism (SNP) networit corresponding to 41 SNPs and 6 traits. This figure shows the graphical model 
for 20 causal SNPs (pink, 0-19), 21 noncausal SNPs (green, 20-40), and 6 phenotypes (red, 41-46). The 6 phenotypes are age (41), hypertension 
(42), systolic blood pressure (SBP) (43), diastolic blood pressure (DBP) (44), blood pressure (BP) medication use (45), and smoking status (46). The 
red edges show the associations with the phenotypes. The blue edges show the linkage disequilibrium (LD) between the causal and noncausa 
variants. The gray edges show the LD among the noncausal variants. 
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the phenotypes were interrelated, except for smoking 
status, which was independent of the other phenotypes 
and any genetic variants. 

We also conducted additional validation of the pro- 
posed method where we randomly selected 21 noncausal 
SNPs from chromosome 3 that were not in LD with any 
of the causal SNPs or among themselves. All of the 
other phenotypes and the causal SNPs were as in the 
previous scenario. As expected, the resulting sparse gra- 
phical network (not shown) had no edges among the 
noncausal variants, and there were no edges connecting 
the causal variants and noncausal variants. The part of 
the network corresponding to the phenotypes and the 
causal SNPs was similar to the previous scenario. 

Discussion and conclusions 

Graphical models provide an intuitive and straightfor- 
ward way to visualize and use complex relationships 
among data. These models have mainly been used for 
analyzing case-control data among unrelated individuals. 
Here we have proposed a straightforward graphical 
method of accounting for correlation in pedigrees that 
can be used for decorrelating family data or, in general, 
for decorrelating correlated samples. If one is analyzing 
family data and needs to use a methodology that is sui- 
table for case-control data with unrelated individuals, 
the data must first be decorrelated. In such cases, we 
can use the correlation structure identified from such 
graphical models as the variance-covariance matrix for 
the phenotypes. In this paper, we used c, induced corre- 
lation due to the shared environment, to be equal to 
0.1. However, we have found that the method is robust 
to slight departures from the true value of c. The pro- 
posed model can be directly incorporated as a hierarchy 
into a Bayesian hierarchical model for simultaneously 
analyzing the phenotypes while taking into account the 
correlation among the family members. 

We also explored a sparse network model that con- 
structs an intuitive network graph including SNPs and dis- 
crete or continuous phenotypes. The network structure 
with the genetic and nongenetic factors is not perfect (in 
the sense that all the causal variants did not have links to 
the phenotypes). This may be because of the small sample 
size of the study. This raises an important question of 
whether we need to increase the number of pedigrees or 
the number of individuals within a pedigree. It is also 
important to note that one cannot assign a statistical sig- 
nificance (e.g., /j-value) to the sparse graphical network as 
it is a data-driven structure. In the future, we intend to 
conduct follow-up studies investigating how to use the 
relationships between SNPs and between SNPs and phe- 
notypes when analyzing genome-wide association data 
with multiple phenotypes. 
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